Please we need to discuss project motivation and our experiment, just as Rida asked
Then we can begin our analysis
We chose this dataset from the UCI’s machine learning repository for its categorical predictive attributes. It contains 1994 Census data pulled from the US Census database. The prediction task we’ve set forth is to predict if a person salary range is >50k in a 1994, based on the various categorical/numerical attributes in the census database. The link to the data source is below:
https://archive.ics.uci.edu/ml/datasets/census+income
The effectiveness of our algorithm will be determined by support, confidence and lift. As these are the metrics that describe how strong a relationship between each element is with the other elements within each transaction. Currently, there are no methods for cross validation of association rules, although there are some hard working individuals out there that are attempting to create such a tool.
Here we will discuss each attribute and give some description about its ranges.
First, lets go ahead and load up necessary libraries:
Loaded Packages:
arules, arulesViz, forcats, dplyr, plotly, data.table,
pander, knitr, skimr, lubridate, ggplot2, cowplot,
foreach, doParallel
Next, lets import our dataset
data <- read.csv("./data/adult-training.csv")
The first thing we must do is check and see if there are any NAs in our dataset, just to make sure to not mess up our analysis.
NA_sum <- sort(sapply(data, function(x) sum(is.na(x))), decreasing = TRUE)
data.frame((NA_sum))
Looks like we are doing ok here. The next issue we have in the dataset, is because of the way the csv was stored, some of the levels in our factors include leading and trailing whitespace. This is highly undesirable, so we must clean it up:
GetFactors <- function(df) {
return(names(Filter(is.factor, df)))
}
FixLevels <- function(x) {
levels(x) <- trimws(levels(x))
return(x)
}
data[GetFactors(data)] <- lapply(data[GetFactors(data)], FixLevels)
pander(lapply(data[GetFactors(data)], levels))
Next, we need to reencode our data as factors. First, lets encode the education levels into factors with larger groups (for example 1st-12th grade should be no diploma, not a bunch of levels).
data$education <- fct_collapse(data$education, `No Diploma` = c("1st-4th", "5th-6th",
"7th-8th", "9th", "10th", "11th", "12th", "Preschool"), Associates = c("Assoc-acdm",
"Assoc-voc"), Diploma = c("Some-college", "HS-grad"))
Then the the income brackets:
data$income_bracket <- fct_collapse(data$income_bracket, small = "<=50K", large = ">50K")
Next, lets change the ? levels to something more useful:
levels(data$workclass)[levels(data$workclass) == "?"] <- "Other"
levels(data$occupation)[levels(data$occupation) == "?"] <- "Other-service"
levels(data$native_country)[levels(data$native_country) == "?"] <- "Other"
Next, lets remove the fnlwgt, education number, and capital gain and loss columns, as they are unneeded. We also need to rename some columns to be easier for us, and use the cut function to factorize our numeric variables
data <- data[, -c(3, 5, 11:12)]
data$age <- cut(data$age, breaks = c(15, 25, 45, 65, 100), labels = c("Young",
"Middleaged", "Senior", "Retired"))
data$hours_per_week <- cut(data$hours_per_week, breaks = c(0, 20, 40, 60, 80),
labels = c("part-time", "full-time", "hard-working", "need-a-life"))
str(data)
#> 'data.frame': 32561 obs. of 11 variables:
#> $ age : Factor w/ 4 levels "Young","Middleaged",..: 2 3 2 3 2 2 3 3 2 2 ...
#> $ workclass : Factor w/ 9 levels "Other","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
#> $ education : Factor w/ 7 levels "No Diploma","Associates",..: 3 3 5 1 3 6 1 5 6 3 ...
#> $ marital_status: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
#> $ occupation : Factor w/ 14 levels "Other-service",..: 2 5 7 7 10 5 1 5 10 5 ...
#> $ relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
#> $ race : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
#> $ gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 1 1 1 2 1 2 ...
#> $ hours_per_week: Factor w/ 4 levels "part-time","full-time",..: 2 1 2 2 2 2 1 3 3 2 ...
#> $ native_country: Factor w/ 42 levels "Other","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
#> $ income_bracket: Factor w/ 2 levels "small","large": 1 1 1 1 1 1 1 2 2 2 ...
Lets see the results:
levels(data$workclass)
#> [1] "Other" "Federal-gov" "Local-gov"
#> [4] "Never-worked" "Private" "Self-emp-inc"
#> [7] "Self-emp-not-inc" "State-gov" "Without-pay"
pander(summary(data))
| age | workclass | education |
|---|---|---|
| Young : 6411 | Private :22696 | No Diploma : 4253 |
| Middleaged:16523 | Self-emp-not-inc: 2541 | Associates : 2449 |
| Senior : 8469 | Local-gov : 2093 | Bachelors : 5355 |
| Retired : 1158 | Other : 1836 | Doctorate : 413 |
| NA | State-gov : 1298 | Diploma :17792 |
| NA | Self-emp-inc : 1116 | Masters : 1723 |
| NA | (Other) : 981 | Prof-school: 576 |
| marital_status | occupation | relationship |
|---|---|---|
| Divorced : 4443 | Other-service :5138 | Husband :13193 |
| Married-AF-spouse : 23 | Prof-specialty :4140 | Not-in-family : 8305 |
| Married-civ-spouse :14976 | Craft-repair :4099 | Other-relative: 981 |
| Married-spouse-absent: 418 | Exec-managerial:4066 | Own-child : 5068 |
| Never-married :10683 | Adm-clerical :3770 | Unmarried : 3446 |
| Separated : 1025 | Sales :3650 | Wife : 1568 |
| Widowed : 993 | (Other) :7698 | NA |
| race | gender | hours_per_week |
|---|---|---|
| Amer-Indian-Eskimo: 311 | Female:10771 | part-time : 2928 |
| Asian-Pac-Islander: 1039 | Male :21790 | full-time :20052 |
| Black : 3124 | NA | hard-working: 8471 |
| Other : 271 | NA | need-a-life : 902 |
| White :27816 | NA | NA’s : 208 |
| NA | NA | NA |
| NA | NA | NA |
| native_country | income_bracket |
|---|---|
| United-States:29170 | small:24720 |
| Mexico : 643 | large: 7841 |
| Other : 583 | NA |
| Philippines : 198 | NA |
| Germany : 137 | NA |
| Canada : 121 | NA |
| (Other) : 1709 | NA |
We’d also like to get a quick feel for the dataset through some visulizations.
p1 <- ggplot(data, aes(x = age, color = income_bracket, fill = income_bracket)) +
geom_density(alpha = 0.9) + labs(x = "Age", y = "Density", title = "Age Density by Income",
subtitle = "Density plot")
p2 <- ggplot(data, aes(x = education, fill = income_bracket, color = income_bracket)) +
geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Education",
y = "Proportion", title = "Income bias based on Education", subtitle = "Stacked bar plot")
p3 <- ggplot(data, aes(x = marital_status, fill = income_bracket, color = income_bracket)) +
geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Marital Status",
y = "Proportion", title = "Income bias based on Marital status", subtitle = "Stacked bar plot")
p4 <- ggplot(data, aes(x = occupation, fill = income_bracket, color = income_bracket)) +
geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Occupation Status",
y = "Proportion", title = "Income bias based on Occupation status", subtitle = "Stacked bar plot")
p5 <- ggplot(data, aes(x = hours_per_week, color = income_bracket)) + labs(x = "Hours per week",
title = "Hours per week by Income", subtitle = "Density plot")
p6 <- ggplot(data, aes(occupation)) + geom_bar(aes(fill = education), width = 0.5) +
theme(axis.text.x = element_text(angle = 60, vjust = 0.5)) + labs(title = "Histogram of occupation with education binning",
subtitle = "Occupation and Educational")
p1
p2
p3
p4
p5
p6
####-TODO-Discuss graphs
Finally, we can set up our dataset to be the proper data format for the Apriori algorithm:
data
data <- as(data, "transactions")
summary(data)
#> transactions as itemMatrix in sparse format with
#> 32561 rows (elements/itemsets/transactions) and
#> 102 columns (items) and a density of 0.1077805
#>
#> most frequent items:
#> native_country=United-States race=White
#> 29170 27816
#> income_bracket=small workclass=Private
#> 24720 22696
#> gender=Male (Other)
#> 21790 231771
#>
#> element (itemset/transaction) length distribution:
#> sizes
#> 10 11
#> 208 32353
#>
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 10.00 11.00 11.00 10.99 11.00 11.00
#>
#> includes extended item information - examples:
#> labels variables levels
#> 1 age=Young age Young
#> 2 age=Middleaged age Middleaged
#> 3 age=Senior age Senior
#>
#> includes extended transaction information - examples:
#> transactionID
#> 1 1
#> 2 2
#> 3 3
Option B: Association Rule Mining • Create frequent itemsets and association rules. • Use tables/visualization to discuss the found results. • Use several measure for evaluating how interesting different rules are. • Describe your results. What findings are the most compelling and why?
Before we begin with our analysis, lets check out the rule frequencies within the dataset. We are looking for rules with support >= .2
itemFrequencyPlot(data, support = 0.2)
Next, lets mine some rules with the apriori algorithm, and then clean up redundant rules. We are still sorting out what to set the minsupp and minconf to.
zerules <- apriori(data, parameter = list(minlen = 2, supp = 0.2, conf = 0.15),
appearance = list(rhs = c("income_bracket=small", "income_bracket=large"),
default = "lhs"), control = list(verbose = F))
length(zerules)
#> [1] 91
redundant <- is.redundant(zerules)
zerules.pruned <- zerules[redundant == FALSE]
rulesorted <- sort(zerules.pruned, by = "lift", decreasing = TRUE)
length(rulesorted)
#> [1] 25
Next, let us inspect the rules, and examine their quality
(quality(rulesorted))
inspectDT(rulesorted)
First lets view a scatterplot of our rules
plot(rulesorted, method = "scatterplot", measure = c("confidence", "support"),
shading = "lift", engine = "htmlwidget")
Next lets look at a balloon plot
plot(rulesorted, method = "graph", measure = "confidence", shading = "lift",
engine = "htmlwidget")
Parallel plot
plot(rulesorted, method = "paracoord", measure = "confidence", shading = "lift",
control = list(reorder = T))
Two key plot
plot(rulesorted, method = "two-key plot", measure = "confidence", shading = "lift",
engine = "htmlwidget")
grouped plot
plot(rulesorted, method = "grouped", measure = "confidence", shading = "lift")
rule2 <- apriori(data, parameter = list(minlen = 2, supp = 0.1, conf = 0.9),
appearance = list(rhs = c("income_bracket=small", "income_bracket=large"),
default = "lhs"), control = list(verbose = F))
length(rule2)
#> [1] 128
redundant <- is.redundant(rule2)
rulep <- rule2[redundant == FALSE]
rulesorted2 <- sort(rulep, by = "lift", decreasing = TRUE)
length(rulesorted2)
#> [1] 56
head(quality(rulesorted2))
inspectDT(rulesorted2)
plot(rulesorted2, method = "scatterplot", measure = c("confidence", "support"),
shading = "lift", engine = "htmlwidget")
plot(rulesorted2, method = "graph", measure = "confidence", shading = "lift",
engine = "htmlwidget")
plot(rulesorted2, method = "two-key plot", measure = "confidence", shading = "lift",
engine = "htmlwidget")
plot(rulesorted2, method = "grouped", measure = "confidence", shading = "lift")
Be critical of your performance and tell the reader how you current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?
• How useful is your model for interested parties (i.e., the companies or organizations that might want to use it)? • How would your deploy your model for interested parties? • What other data should be collected? • How often would the model need to be updated, etc.?